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Abstract —Fault-tolerance techniques for stream processing 
engines can be categorized into passive and active approaches. 
A typical passive approach periodically checkpoints a processing 
task’s runtime states and can recover a failed task by restoring its 
runtime state using its latest checkpoint. On the other hand, an 
active approach usually employs backup nodes to run replicated 
tasks. Upon failure, the active replica can take over the processing 
of the failed task vrlth minimal latency. Hovrever, both approaches 
have their own inadequacies in Massively Parallel Stream Pro¬ 
cessing Engines (MPSPE). The passive approach incurs a long 
recovery latency especially when a number of correlated nodes 
fail simultaneously, while the active approach requires extra 
replication resources. In this paper, we propose a new fault- 
tolerance framework, which is Passive and Partially Active (PPA). 
In a PPA scheme, the passive approach is applied to all tasks 
while only a selected set of tasks will be actively replicated. 
The number of actively replicated tasks depends on the available 
resources. If tasks without active replicas fail, tentative outputs 
will be generated before the completion of the recovery process. 
We also propose effective and efficient algorithms to optimize 
a partially active replication plan to maximize the quality of 
tentative outputs. We implemented PPA on top of Storm, an 
open-source MPSPE and conducted extensive experiments using 
both real and synthetic datasets to verify the effectiveness of our 
approach. 

I. Introduction 

There is a recently emerging interest in building Massively 
Parallel Stream Processing Engines (MPSPE), such as Storm 
ll24l . and Spark Streaming l^ . which make use of large-scale 
computing clusters to process continuous queries over fast 
data streams. Such continuous queries often run for a very 
long time and would unavoidably experience various system 
failures, especially in a large-scale cluster. As it is critical to 
provide continuous query results without significant downtime 
in many data stream applications, fault-tolerance techniques in 
Stream Processing Engines (SPEs) a, 0, m have attracted 
a lot of attention. 

Existing fault-tolerance techniques for SPEs can be gen¬ 
erally categorized as passive and active approaches fT3l . In a 
typical passive approach, the runtime states of tasks will be 
periodically extracted as checkpoints and stored at different 
locations. Upon failure, the state of a failed task can be restored 
from its latest checkpoint. While one can in general tune the 
checkpoint frequency to achieve trade-offs between the cost of 
checkpoint and the recovery latency, the checkpoint frequency 
should be limited to avoid high checkpoint overhead, which 
affects the system performance. Hence recovery latency is 
usually significant in a passive approach. When one wants 
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to minimize the recovery latency as much as possible, it is 
often more efficient to use an active approach, which typically 
uses one backup node to replicate the tasks running on each 
processing node. When a node fails, its backup node can 
quickly take over with minimal latency. 

Even though there are abundant fault-tolerance techniques 
in SPEs, developing an MPSPE ll24ll poses great challenges 
to the problem. Eirst of all, in a large cluster, there are 
often two different types of failures: independent failure and 
correlated failure Go), lEIl. Previous studies mostly focused 
on independent failure that happens at a single node. Correlated 
failures are usually caused by failures of switches, routers 
and power facilities, and will involve a number of nodes 
failing simultaneously. With such failures, one has to recover 
a large number of failed tasks and temporarily run them on 
an additional set of standby nodes before the failed ones are 
recovered. Using a passive fault tolerance approach, one has 
to keep the standby nodes running even their utilization is low 
most of the time in order to avoid the unacceptable overhead of 
starting them at recovery time. Eurthermore, as checkpoints of 
different nodes are often created asynchronously, massive syn¬ 
chronizations have to be performed during recovery. Therefore 
it could be difficult to meet the user requirements on recovery 
latency even with a relatively high checkpoint frequency. 

On the other hand, while an active fault-tolerance approach 
can achieve a lower recovery latency, it could be too costly 
for a large-scale computation. Consider a large-scale stream 
computation that is parallelized onto 100 nodes, one may 
not be able to afford another 100 backup nodes for active 
replication. 

Another challenge is that there exist some time-critical ap¬ 
plications which prefer query outputs being generated in good 
time even if the outputs are computed based on incomplete 
inputs. This kind of applications usually require continuous 
query output for real-time opportune decision-making or vi¬ 
sualization. Consider a community-based navigation service, 
which collects and aggregates user-contributed traffic data in 
a real-time fashion and then continuously provides navigation 
suggestions to the users. Eailure of some processing nodes 
could result in losing some user-contributed data. The system, 
while waiting for the failed nodes to recover, can continue 
to help drivers plan their routes based on the incomplete 
inputs. Other examples of such applications are like intrusion 
detections, online visualization of real-time data streams etc. 
Alerts of events matching the intrusion attack patterns or info¬ 
graphics generated over incomplete inputs are still meaningful 
to the users and should be generated without any major delay. 


Consider the long recovery latency for a large-scale correlated 
failure, the lack of trade-offs between recovery latency and 
result quality would not be able to fulfill the requirements of 
these applications. 

To address the aforementioned challenges, we propose a 
new fault-tolerance scheme for MPSPEs, which is Passive 
and Partially Active (PPA). In a PPA scheme, a number 
of standby nodes will be used to prepare for recoveries 
from both independent and correlated failures. Checkpoints 
of the processing nodes will be stored at the standby nodes 
periodically. Rather than keeping them mostly idled as in a 
purely passive approach, we opportunistically employ them 
for active replications for a selected subset of the running 
tasks. In this way, we can provide very fast recovery for the 
tasks with active replicas. Furthermore, when the failed tasks 
contain those without active replicas, PPA provides tentative 
outputs with quality as high as possible. The results can then 
be rectified after the passive recovery process has been finished 
using similar techniques proposed in 13. In general, PPA is 
more flexible in utilizing the available resources than a purely 
active approach, and in the meantime can provide tentative 
outputs with a higher quality than a purely passive one. 

In this paper, we focus on optimizing utilizing available 
resources for active replication in PPA, i.e. deciding which 
tasks should be included for active replication. In summary, 
we have made the following contributions in this paper: 

(1) We present PPA, a passive but partially active fault- 
tolerance scheme for a MPSPE. 

(2) As existing MPSPEs often involve user defined func¬ 
tions whose semantics are not easily available to the system, 
we propose a simple yet effective metric, referred to as output 
fidelity, to estimate the quality of the tentative outputs. 

(3) We propose an optimal dynamic programming algo¬ 
rithms and several heuristic algorithms to determine which 
tasks to actively replicate for a given query topology. 

(4) We implement our approach in an open-source MPSPE, 
namely Storm ll24ll and perform an extensive experimental 
study on an Amazon EC2 cluster using both real and syn¬ 
thetic datasets. The results suggest that by adopting PPA, the 
accuracy of tentative outputs are significantly improved with 
limited amount of replication resources. 


II. System Model 



Fig. 1 . A topology that consists of 4 operators (Oi, O2, O3, O4) with 
different numbers of tasks. 


A. Data and Query Model 

As in existing MPSPEs ll24l . we assume that a data item 
is modeled as a key-value pair. Without loss of generality, the 
key of a data item is assumed to be a string and the value is 
a blob in an arbitrary form that is opaque to the system. 

A query execution plan in MPSPEs typically consists 
of multiple operators, each being parallelized onto multiple 
processing nodes based on the key of input data. Each operator 
is assumed to be a user-defined function. We model such 
query plan as a topology of the parallel tasks of all the query 
operators. By modeling each task as a vertex and the data 
flow between each pair of tasks as a directed edge, the query 
topology can be represented as a Directed Acyclic Graph 
(DAG). Figure [T] shows an example query topology. Each task 
represents the workload of an operator that is assigned to a 
processing node in the cluster and all the tasks that belong to 
the same operator will conduct the same computation. 

An operator can subscribe to the outputs from multiple 
operators except for itself. The output stream of every task 
will be partitioned into a set of substreams using a particular 
partitioning function, which divides the keys of a stream 
into multiple key partitions and splits the stream into sub¬ 
streams based on these key partitions. For each task, the input 
substreams received from the tasks belonging to the same 
upstream neighboring operator will constitute an input stream. 
Therefore, the number of input streams of a task is up to the 
number of its upstream neighboring operators. 

Similar to Il28l . we consider the following four common 
partitioning situations between two neighboring operators in 
a MPSPE. In the following descriptions, we consider an 
upstream operator containing tasks and a downstream 
operator containing N 2 tasks. 

• One-to-one: each upstream task only sends data to a single 
downstream task and a downstream task only receives 
data from a single upstream task. 

• Split: each upstream task sends data to M 2 , 2 < M 2 < 
N 2 , downstream tasks and each downstream task only 
receives data from a single upstream task. 

• Merge: each upstream task sends data to only one down¬ 
stream task and each downstream task receives data from 
Ml, 2 < Ml < Ni, upstream tasks. 

• Full: each upstream task sends data to all N 2 downstream 
tasks. 

B. PPA Replication Plan 

Given a topology T and its whole set of tasks M., a 
PPA replication plan for T consists of two parts: a passive 
replication plan that covers all the tasks in A4 and a partially 
active replication plan which covers a subset of A4, denoted 
as P. With the passive replication plan, checkpoints will be 
periodically created for all the tasks and stored at the standby 
nodes. For a task ti, its checkpoint consists of f^’s computation 
state and output buffer. After a checkpoint is extracted from 
ti, its upstream neighboring tasks will be notified to prune 
the unnecessary data from their output buffers. The buffer 
trimming should guarantee that, if ti fails, its computation state 
can be recovered by loading its latest checkpoint and replaying 
the output buffers in its upstream tasks. On the other hand, for 








each ti G P, an active replica will be created, which will 
receive the same input data and perform the same processing 
as ti’s primary copy. 

Upon failures, the actively replicated tasks will be re¬ 
covered immediately using their active replicas, meanwhile 
the tasks that are only passively replicated will be restored 
from their latest checkpoints. When there are some failed 
tasks belonging to A4 — P, tentative outputs will be produced 
before they are fully recovered. Such tentative outputs have a 
degraded quality due to the loss of input data that otherwise 
should be processed by the failed tasks belonging to M — P. 
We present how to optimize the partially active replication plan 
to maximize the quality of tentative outputs and the details of 
the system implementation in the following sections. 

III. Problem Formulation 
A. Quality of Tentative Outputs 

Previous works on load shedding IJl, ifTbl have studied how 
to evaluate the quality of query outputs in case of lost of input 
data. Their models assume full knowledge of the semantics of 
individual operators and hence can estimate the output quality 
in a relatively precise way. However, in existing MPSPEs, such 
as Storm, operators are often opaque to the system and may 
contain complex user-defined functions written in imperative 
programming languages. The existing models therefore cannot 
be easily applied. In our first attempt, we have tried to derive 
output accuracy models composed by some generic functions, 
which should be chosen or provided by the users according to 
the semantics of the operators. We found that this approach is 
not very user friendly and it may be very difficult for a user 
to provide such functions for a complicated operator. 

Therefore, we strive to design a model that requires users 
to provide minimum information of an operator’s semantic, but 
yet is effective in estimating the quality of tentative outputs. 
More specifically, we propose a metric, called Output Fidelity 
(OF), which is roughly equal to the ratio of the source input 
that can contribute to tentative outputs. This is based on the 
assumption that the accuracy of tentative outputs increases with 
more complete input and a PPA plan with a higher OF value 
would incur more accurate tentative outputs. 

1) Operator Output Loss Model: It is the sink operator 
that produces the final outputs of a topology. As task failures 
can happen at any position within the topology, we need to 
propagate the information losses incurred by any failed task to 
the output of the sink operator. Suppose task ^22 in Figure |2] is 
failed, we need to transform the input loss of tsi into its output 
loss. In this subsection, we propose the operator output loss 
model, which estimates the information loss of an operator’s 
output based on the information loss of its input. In the next 
subsection, we present the precise definition of OF. 

In following descriptions, the set of input streams of task 
U are denoted as {S'-", S'™ }, where the rate of S-" 

is represented as A™ and its information loss is referred to 
as ILfj. The rate of tfs output stream, S°“‘, is referred to 
as A°“*, and its information loss is denoted as If ti 

is failed, its output will be lost and will be set as 1 . 

Otherwise, we calculate based on the information losses 

of U’s input streams. 


As described in the query model, an input stream of a task 
may consist of multiple substreams, which are sourced from 
tasks belonging to the same upstream neighboring operator. 
Suppose that S'*" consists of a set of substreams [/*". For 
each substream Sk, Sk G Uj™, denoting its rate as and its 
information loss as ILg,,, then the information loss of S™ is 
calculated as: 
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( 1 ) 


Meanwhile, the output stream of task ti, S°"*, can be split 
into a set of substreams, denoted as For each substream 

Sk belonging to its information loss is estimated to be 

equal to i.e. /L,, = 

Figure |2] depicts an example topology as well as the rate 
of each output stream. represents the information loss 

of output stream caused by the failure of task ^ 22 - We 
distinguish two situations and use this example to illustrate the 
calculation of information loss of a task’s output stream. 



Fig. 2 . An illustrating topology with task failure, where Ag" j = A°“*-|-A°2* 
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Correlated-Input Operator. Ot performs computations 
over the join results of its input streams. For example, suppose 
O 3 in Figure |2] is a join operator. Without further semantic 
information of O 3 , we consider the effective input of ^31 as 
the Cartesian product of its input streams, whose rate is equal 
to 1 ■ -^31 2 ) information loss can be computed 

as [1 — (1 — /Fg" g) • (1 — 2 )]- By assuming that the 

information loss of fgi’s output should be equal to that of 
its input stream, we can get /Fg"* = |. In summary, the 
information loss of tfs output stream can be calculated as: 


iF^=1 - n (1 - 


( 2 ) 


Independent-Input Operator. Ot does not compute joins 
over input streams. If Og in Figure |2] is an independent-input 
operator, the effective input of fgi is considered as the union 
of its input streams, whose rate is equal to (Ag" g + Ag" 2) 

and its input loss can be calculated as . 

Similar to the correlated-input operator, we also assume that 
the information loss of fgi’s ouptut should be equal to that of 
its input stream. Then we have, in this example, /Fg"* = 2. 
In general, the information loss of tfs output stream can be 
calculated as follows: 
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Recall that one of the design principles is to request as 
little information of the operators’ semantics as possible. We 
distinguish the aforementioned two types of operators simply 
because the characteristics of their effective inputs are very 
different. With such distinction, the OF metric can be estimated 
much more precisely. 

2 ) Output Fidelity: With the operator output loss model, 
the output information losses of tasks in the sink operator 
can be calculated by conducting a depth-first traversal of the 
topology, which starts from the tasks in the source operators 
and ends at the tasks in the sink operator. 


By denoting the sink operator of topology T as Osink , and 
the set of tasks belonging to Osink as The 

output fidelity of topology T, OFt, is defined as; 


OFt = 1 


E Aft \out 


■IL° 


-V out 
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(4) 


B. Problem Statement 

Before presenting the problem definition, we introduce a 
concept; Minimal Complete Tree, which is also referred to as 
MC-tree for simplicity in the following sections. 

Definition 1. Minimal Complete Tree (MC-Tree).- A 
minimal complete tree is a tree-structured subgraph of the 
topology DAG. The source vertices of this subgraph corre¬ 
spond to tasks from the source operators and its sink vertex is 
a task from an output operator. A minimal complete tree can 
continuously contribute to final outputs if and only if all its 
tasks are alive. 


Algorithm 1: Dynamic Programming; PlanCorrelat- 
EDFAILURE(i?) 


Input; Amount of available resources i?; 

Output; Replication plan P; 

1 CPo -s- 0; usage ^ 0 ; SC ^ {CPo}; 

/* CPo:initial replication plan; 

SC:candidate plan set; */ 

2 while usage -I- -f < P do 

3 foreach candidate plan CPi in SC do 

4 dif <— usage — |CPi|; 

/* |CPi| is the number of replicated 

tasks in CP; and dif is the number 
of tasks that can be added to CPi 
at this step; */ 

5 UTi t— { MC-tree tr \tr ^ CP;}; 

6 Ui ^ max{nonrep_tasks(tr, CPi) | tr G UTi}-, 

/* nonrep_tasks(fr, CPi) returns the 

number of non-replicated tasks of 
MC-tree tr in CP;; */ 

7 if dif < Ui then 

8 foreach MC-tree trj G {tr \ tr ^ 

CPi & nonrep_tasks(trj, CPi) == dif} do 

9 CPj t— CPi U trj-, 

10 if CPj ^ SC then 

11 I Add CPj to SC; 

12 else Remove CPi from SC; 

13 P t— the candidate plan in SC with the maximal OF value. 
Return P; 


replicated can produce tentative outputs. The optimization 
objective is to maximize the value of OF with limited amount 
of resources used for active replication. 


Taking the topology in Figure [T] for instance, if O 3 is an 
independent-input operator, tasks in {fn, fsi, ^ 41 } can consti¬ 
tute an MC-tree and there are in total 16 MC-trees in the 
topology. However, if O 3 is a correlated-input operator, 
cannot produce any output if either fn or ^21 fails. Hence 
tasks in {fn,^ 21 ,^ 31 ,^ 41 } can constitute an MC-tree and the 
number of MC-trees in the topology is equal to 8 . 

Based on Definition [T] if failures of tasks in an MC- 
Tree occur, it will only continue propagating data to the sink 
operator if and only if all of its failed tasks are actively 
replicated. Suppose topology T consists of a set of operators 
Oi, O 2 ,..., Oat and the available resources can be used to 
actively replicate R tasks {R < \Ai\, where Ai is all the 
tasks of T), then the problem of optimizing a partially active 
replication plan is defined as follows; 

Definition 2. Partially Active Plan.- Given a query 
topology T, choose R tasks for active replication such that, 
the output fidelity of the partial topology that is composed of 
the actively replicated MC-trees in T is maximized. 

This problem is NP-hard, as it can be polinomially reduced 
from the Set-Union Knapsack Problem ||8l, which is NP-hard. 

IV. Active Replication Optimization 

Recall that we consider the worst case scenario for a 
correlated failure, i.e. there is at least one failed task in 
every MC-tree. Before the completion of the passive recovery 
process, only the MC-trees whose failed tasks are actively 


A. Dynamic Programming 

We first present a dynamic programming algorithm that 
can generate an optimal replication plan for correlated failure. 
As has been introduced in section IIII-BI we take MC-tree 
as the basic unit for replication candidates in the algorithm. 
Details of this algorithm are presented in Algorithm [T] It is 
essentially a bottom-up dynamic programming algorithm. We 
incrementally increase the number of resources to be used for 
active replication and enumerate the possible expansions of the 
plans produced in the previous step. Assuming the minimum 
size of MC-trees is r, one can obtain the first set of replication 
plans, referred to as SC, by replicating r tasks. At this step, 
each plan in SC contains exactly one MC-tree. Note that the 
MC-trees that have not been added to a candidate plan CPi 
may also have replicated tasks if they share some tasks with 
another MC-tree within CPj. 

At the next iteration of the while loop starting at line 
2, we increase the resource usage by 1. We scan through 
each candidate plan CPj G SC to see if there is an MC- 
tree trj CPi that contains a number of non-replicated tasks 
which is equal to usage — ICP;!, where \CPi\ is the number 
of replicated tasks in CPi. For each MC-tree satisfying this 
condition, we create a new candidate plan CPj (line 9) such 
that CPj -(— CPi U trj. If CPj has no duplicate in SC, then 
it will be inserted into SC. The algorithm will continue until 
usage is equal to the limit R. 

The cost of scanning through SC can be reduced by 
removing a candidate plan CPi from SC if all its possible 












Algorithm 2: GREEDY(i?) 

Input: The amount of available resources i?; 

Output: Replication plan P 

1 Initialize: AS •<— 0; 

2 foreach Task ^ P do 

3 Ai -h- the value of OF if only ti fails; 

4 AS'UlAi}; 

5 Sort AS in ascending order; 

6 TS t— set of tasks whose corresponding OF values are among 
top-7? in AS', 

7 P t- P U rs"; 

8 Return P 


expansions have been considered. More precisely, remove CPi 
from S'C' if the maximum number of non-replicated tasks of 
the MC-trees not included in CPi is less than the difference 
between the available resource at the current iteration, i.e. 
usage, and the current number of replicated tasks in CPi (lines 
7 and 12). After the while loop is finished, the candidate plan 
with the maximal OF in SC will be returned. 

The upper bound of the complexity of this algorithm is 
0(2^), where T is the number of MC-trees in the query 
topology, which varies with the topology structures and has an 
upper bound of 0{M^), where N is the number of operators 
and M is the average degree of parallelization of operators 
in T. The following theorem states the optimality of this 
dynamic programing algorithm, the proof is skipped due to 
space limitation. 

Theorem 1. Let P be the replication plan produced by 
Algorithm\I\and Pt be a different replication plan. If OFp^ > 
OFp, then the resource usage of P is always equal to or less 
than that of Pt. 

B. Greedy Algorithm 

We present a greedy algorithm. For each task in the 
topology, the greedy algorithm will calculate the OF of the 
topology by only failing this task. A task whose failure would 
lead to a smaller OF will be assigned a higher priority for 
replication. We present the details of this greedy algorithm in 
Algorithm |2] which will first rank all the tasks in ascending 
order based on the OF calculated by their respective failures. 
Then it will iterate to choose the corresponding task that would 
cause the minimal OF among all the remaining non-replicated 
tasks in the set AS. 

The complexity of the greedy algorithm is 0{N ■ M), 
where the notations are defined in Section IIV-AI Although 
this complexity is much lower than that of the dynamic 
programming algorithm, it fails to consider whether the tasks 
in the replication plan could form complete MC-trees, which 
will damage its performance especially when the number of 
active replicated tasks is small. The experimental results in 
section IVl-BI can verify this defect of the greedy algorithm. 

C. Structure-Aware Algorithm 

The dynamic programming algorithm searches for the 
optimal plan by selecting a subset of MC-trees for replication 
under the resource constraint to maximize the value of OF. 


Inspired by this, we design a structure-aware algorithm that, at 
each step, rather than enumerating all the possible expansions 
of a candidate plan, only expands it with an MC-tree that can 
incur the greatest increase in OF per resource unit. 

Unfortunately, even such a greedy approach may fall short 
under the following situation. Consider a topology T that 
consists of a sequence of k operators and all the operators use 
Full partitioning, the number of MC-trees within T is equal to 
riiLi where Mi is the number of tasks of operator Oi. In 
such a topology, the number of MC-trees will grow very fast 
with increasing number of operators. Therefore, even a greedy 
search among the possible combinations of MC-trees would 
not perform well. 

To solve this problem, we firstly decompose a general 
topology into two types of topologies, namely full topologies 
and structured topologies, and then optimize them separately. 
The definitions of these two types of topologies are as follows; 

• Structured topology is defined as a topology where only 
the operators, that produce outputs of this topology, can 
have a Full partitioning function and the others have other 
types of partitioning functions. 

• Full topology is defined as a topology that all of its 
operators have a Full partitioning function. 

The rest of this section is organized as follows: firstly, we 
present the algorithms generating PPA plans for structured 
topologies and full topologies respectively. Then we will 
explain the structure-aware algorithm, which generates the PPA 
plan for a general topology by decomposing it into several 
sub-topologies, each being either a structured topology or a 
full topology. 

1) Algorithm for Structured Topology: Although we define 
structured topology such that Full partitioning only exists in 
the output operators, the number of MC-trees in a structured 
topology could still be very large. Consider the situation that 
a task ti receives Nin input streams and produce Nout output 
streams, there will be at least Nin * Nout MC-trees containing 
ti. In addition, if U joins Nk substreams from operator Ok 
with Nj substreams from operator Oj, the number of MC- 
trees containing ti will at least be equal to Nk ■ Nj. To avoid 
bad performance due to the large number of MC-trees, we 
split a structured topology into multiple units such that, within 
a unit, the number of MC-trees is equal to the maximal number 
of input substreams among the operators of this unit. We refer 
to an MC-tree in a unit as segment to differentiate it from the 
concept of a complete MC-tree in the topology. 



(a) (b) 


Fig. 3. Examples of splitting structured topologies into units. O3 in 
Figure |3(b)| is a join operator. 

The situation of multiple input streams and multiple output 
streams occurs on the task who has an input stream partitioned 
with Merge and an output stream partitioned with Split, a unit 
boundary will be set between this operator and its upstream 




















Algorithm 3: PlanStructuredTopology(P, i?, T) 

Input: An initial plan P; The amount of available resources 
i?; Topology T; 

Output: Replication plan P; 

1 usage = 0 ; 5 ^ Set of the units split from topology T; 

2 foreach Unit Ui G Su do 

3 I Build segment set Gi\ 

4 while usage < R do 

s Candidates •<— 0 ; 

6 foreach Unit Ui G Su do 

7 foreach non-replicated segment gi G Ui do 
CGi <— {gi}’, 

9 if OFp — OFpuCGi then 

to Conduct a BPS from Ui to traverse all the 

units: 

11 foreach visited unit Uj during the BFS do 

12 Segment gj ■<— max_of (Uj) ; 

/* max_of (f/j) returns the 

segment in Uj , which is 
connected with segment in 
CGi and has the maximal 
OF with Uj treated as an 
independent topology; */ 

13 I I I if\CG^ + \gj\ < usage then 

14 I CGi = CGi U gj', 

15 else Stop the BFS; 

16 Candidates t— Candidates U CGi’, 

17 Find CGopt from Candidates such that the following 
value is maximized: (OPpuCGopt — OFp)/\CGopt\’, 

18 P = P n CGopt’, usage = usage + \CGopt\’, 

19 if CGopt / 0 then return P; 

20 Remove the completely replicated units from Su’, 

21 Return P; 


neighboring operator using Merge partitioning. For instance, 
a unit boundary is set between Oi and O 2 in the topology 
in Figure |3(a)| The situation that a task joins multiple input 
substreams from one operator with substreams from other 
operators happens on the tasks of join operators that have at 
least one input stream partitioned with Merge. As illustrated 
in Figure [3(b)] a unit boundary is set between Oi and O 3 . 

Note that, with such a decomposed topology, replicating a 
segment is beneficial only if all the other segments within the 
same complete MC-tree are also replicated. In other words, 
we should avoid enumerating plans that replicate a set of 
disconnected segments. 

The details of the algorithm for structured topology are 
presented in Algorithm [3 The algorithm searches through 
the units generated from input topology. Within unit Uj, if 
the set of non-replicated segments is not empty, we check 
whether replicating these segments will increase the final 
output accuracy (line 9). Note that this will only be true if this 
segment can form a complete MC-tree with the other replicated 
segments within the current plan. Each of such segments will 
be put into a candidate pool (line 16). If the segment gt does 
not enhance the plan’s OF, we conduct a BFS (Breadth-first 
search) starting from Ui and traversing through all the units in 
Topology T. The BFS is terminated until usage is less then the 
non-replicated tasks in CGi. Finally, every unit visited during 
the BFS contributes a segment to CGi and the segments from 
neighboring units are connected (lines 10 — 15). Then we put 
such a set of segments as one candidate in the candidate pool. 


Algorithm 4: PlanFullTopology(P, R, T) 


1 
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4 

5 

6 

7 

8 
9 
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12 

13 

14 
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16 
17 


Input: Initial replication plan P; Amount of available 
resources i?; Topology T; 

Output: Replication Plan P 
Initialize: usage t— 0; 

N •«— Number of operators’. 

Sort the set of tasks Si of each operator Oi based on the OF 
increase, Sij , of tasks; 
if P = &e N < R then 
foreach Oi do 

Let pik be the node in Si that has the largest OF 
increase Sik’, 

P t- P U {pik}’. Si Si - {pik}’. 


usage = A; 
ifP = 0& N>i? then return P; 
while usage < R do 
Candidates 4— 0; 

foreach Oi do 

Let Pik be the node in Si that has the largest OF 
increase Sik’, 

Candidates 4— Candidates U P; U {pik}’, 
max_accuracy_plan(Candidafes); 


Return P; 


Sj - {pjk}’, P Pj’, usage + -f; 


After finishing the scanning of all units, we get a candidate 
pool consisting of a number of segment sets, each containing 
one or more segments. We use a profit density function to 
rank the candidates. The profit density of a candidate CGk is 
calculated as {OFpuCGk ~ OFp)/\CGk\, where OFp is the 
OF value of plan P, OFpucGk ’^he OF value after expanding 
P by replicating segment in CGk- \CGk\ is the number of 
non-replicated tasks within CGk- The plan in the candidate 
pool with the maximum profit density will be merged with 
the input plan P and returned. The complexity of Algorithm 
|3]is equal to 0{R ■ N ■ ■ E), where R is the amount of 

available replication resources, N is the number of operators, 
M represents the average degree of parallelization of operators 
in T, and E is the number of neighboring unit pairs. 

2) Algorithm for Full Topology: Each task within a full 
topology will send input data to all the tasks that belong to its 
downstream neighboring operators. We propose an algorithm 
for full topology as illustrated in Algorithmic The basic idea 
of this algorithm is that, within any operator, we always prefer 
to replicate the task that will bring the maximum increase of 
OF under the assumption that all the other tasks that belong 
to the same operator are failed and the tasks that belong to 
other operators are alive. We denote the increase of OF by 
replicating task tij as 5ij. If the input plan P is empty, we 
first select one task from each operator that has the largest 6 ij 
among all the tasks in this operator and put it into P (lines 
4 — 7). If P is not empty, we iterate and select R tasks that have 
larger OF increases, i.e. Sik , than other tasks in the topology 
and put them into P (lines 10 — 16). The complexity of this 
algorithm is 0{N ■ R), where R is the amount of available 
replication resources and N is the number of operators. 

3) Solution for General Topology: With the above algo¬ 
rithms for specific topology structures, we divide a general 
topology into several sub-topologies and then use the cor¬ 
responding algorithms according to the type of each sub¬ 
topology to generate the replication plans. We require that at 





















Algorithm 5: StructureA wARE(i?,T) 

Input: The amount of available resources i?; Topology T; 
Output: Partial replication plan P; 

1 Initialize: decompose the complete topology T into 
sub-topologies: TSi, rS'2,... ; 

P -t— 0, Syi •<— 0, usage 0; 
if i? < Number of operators in T then 
I Return P ; 

foreach Sub-Topology TSi do 

Ni •<— Number of operators in TSp, 

Pi •<— PlanSubTopology (0, Ri,TSi)\ P P + Pp, 
Pi <— PlanSubTopology (p, Ri,TSi); 

OFp,-OFp. 

Ci P- |P'| - \Pi\- Ai < -- 1 ; 

Put Ai into Sa in descending order; usage-\- = Ni\ 
while usage < R do 

LastUsage usage; j •<— 1; 
while j < do 

Ai <— yth value in Sa; j + +; 


if Ci + usage < R then 

Use P' to replace Pi in P; 

Calculate new Ci, Ai. Insert Ai into Sa in 
descending order; break; 

18 if usage = lastUsage then break; 

19 Return P; 

Function: PlanSubTopology (P, Ni, T) 

20 if T is a full topology then 

21 I P ^ PLANFULLTOPOLOGY(P,Ai,r); 

22 else P PlanStructuredTopology(P, Ai, r); 


least one partitioning function between any two neighboring 
sub-topologies is Full and the amount of sub-topologies is 
minimized. The reason behind this requirement is to make 
the selection of the replication segments in the sub-topologies 
independent from each other. 



Fig. 4. Example of splitting a topology into sub topologies. 

The split algorithm explores the topology using multiple 
depth-first searches (DFS). At the beginning, only the sink 
operator of the given topology is in the start point set SP. 
At each iteration, we will pick an operator, Os, from SP and 
build a sub-topology by performing a DFS starting from Og- If 
the DFS arrives at an operator Oi whose partitioning function 
is incompatible with the type of the current sub-topology, it 
will not further traverse O^’s downstream operators and Oi 
will not be added to the current sub-topology but instead 
be put into SP. Finally the algorithm will terminate until 
SP is empty. Figure |4] presents an example general topology, 
which is decomposed into two sub-topologies: {Oi, 02 , 03 } 
and {04,05,06}- 

We present details of the correlated-failure optimization 
algorithm for a general topology in Algorithm |5] which is 
referred to as the Structure Aware algorithm. The algorithm 
first decomposes the topology into sub-topologies which are 


either full topologies or structured topologies. Then the algo¬ 
rithm runs in multiple iterations. Within each iteration, it will 
try to get a replication plan from each sub-topology and select 
the one with the maximum profit density (lines 11 — 17). The 
loop will be terminated when there is no more resource to 
replicate a complete MC-tree. The algorithm’s complexity is 
equal to 0{R ■ N ■ ■ E), where the notations are defined 

in Section HV-C II 

V. System Implementation 
A. Framework 



Fig. 5. System Framework 

We implemented our system on top of Storm. In comparing 
to Spark Streaming, which processes data in a micro-batching 
approach. Storm will process an input tuple once it arrives 
and thus can achieve sub-second end-to-end processing latency. 
As shown in Figure |5] the nimbus in the Storm master node 
assigns tasks to the Storm worker nodes and monitoring the 
failures. On receiving a job, the nimbus will transfer the query 
topology to the PPA plan manager, which will generate a 
PPA recovery plan under the constraint of resource usage 
of active replication. The PPA recovery plan consists of two 
parts: a completely passive standby plan and a partially active 
replication plan. Based on the PPA recovery plan, the replica¬ 
tion manager in the worker nodes will create checkpoints to 
passively replicate the whole query topology. Checkpoints will 
be stored onto a set of standby nodes. The replication manager 
will create active replicas for the tasks that are included in the 
partially active replication plan. The active replicas can support 
fast failure recovery and will also be deployed onto the standby 
nodes. 

Once a failure is detected by the nimbus. The recovery 
manager in the Storm master node will decide how to recover 
the failed tasks based on the PPA replication plan. For the 
tasks that are actively replicated, the recovery manager will 
notify the nimbus to recover them using their active replicas 
such that the tentative results could be produced as soon as 
possible. The failed tasks that are passively replicated will be 
recovered with their latest checkpoints. 

B. PPA Fault Tolerance 

Passive Replication. In PPA, checkpoints of the processing 
tasks will be periodically created and stored at the standby 
nodes. We adopted the batch processing approach E6i to 
guarantee the processing ordering of inputs during recovery 
is identical to that before the failure. With this approach, input 
tuples are divided into a consecutive set of batches. A task 
will start processing a batch after it receives all its input tuples 














































belonging the current batch. This is ensured by waiting a batch- 
over punctuation from each of its upstream neighboring tasks. 
Tuples within a batch will be processed in a predefined round- 
robin order. The effect of batch size on the system performance 
has been researched in previous work la. 

A single point failure can be recovered by restarting the 
failed task, loading its latest checkpoint and replaying its up¬ 
stream tasks’ buffered data. The downstream tasks will skip the 
duplicated output from the recovering task until the end of the 
recovery phase. While recovering a correlated failure, if a task 
and its upstream neighboring task are failed simultaneously 
and its checkpoint is made later than its upstream peers’, the 
recovery of the downstream task can only be started after its 
upstream peer has caught up with the processing progress. In 
other words, synchronizations have to be carried out among 
the neighboring tasks. 

Active Replication. If task t has an active replica t', the 
output buffer of t' will store the output tuples produced by 
processing the same input in the same sequence as t does. The 
downstream tasks of t will subscribe the outputs from both t 
and t'. By default, the output of t' is turned off. To reduce the 
buffer size on t', its primary, t, will periodically notify t' about 
the latest output progress and the latter can then trim its output 
buffer. If t is failed, t' will start sending data to the downstream 
tasks of t. The downstream tasks will eliminate the duplicated 
tuples from t' by recognizing their sequence numbers. The 
batch processing strategy can guarantee an identical processing 
order between the primary and active replica of a task. 

Tentative Outputs. As checkpoint-based recovery requires 
replaying the buffered data and synchronizations among the 
connected tasks and hence incurs significant recovery latency, 
PPA has the option to continue producing tentative results 
once the actively replicated tasks are recovered. Recall that 
during normal processing, a task will only start processing a 
batch after receiving the batch-over punctuations from all of its 
upstream neighboring tasks. If any of its upstream neighboring 
tasks fails, the recovery manager in the Storm master node 
will generate the necessary batch-over punctuations for those 
failed tasks, such that a batch could be processed without 
the inputs from the failed tasks and tentative outputs will be 
generated with an incomplete batch. After the failed tasks are 
recovered, the recovery manager will stop sending the batch- 
over messages for them such that the downstream tasks will 
wait for the batch contents from the recovered tasks before 
processing a batch. After all the failed tasks are recovered, the 
topology will start generating accurate outputs. 

In this paper, we assume the adoption of similar techniques 
proposed in 0 to reconcile the computation state and correct 
the tentative outputs and leave the implementation of these 
techniques as our future work. 

C. Dynamic Plan Adaptation 

Considering that tasks’ input rates may fluctuate over time, 
the active replication plan should be dynamically adapted 
accordingly. The PPA plan manager periodically collects the 
input rates of all the processing tasks and generate new active 
replication plan. If the new plan is different from the previously 
applied plan, applying the new plan may require deactivating 
the active replicas of a set of tasks and generating active 


replicas for another set of tasks. Deactivating the active replicas 
can be implemented by terminating their processing and releas¬ 
ing their occupied resources. To generate new active replicas, 
we can send the corresponding checkpoints to the destination 
nodes and initialize the state of the active replicas by using 
the checkpoints. The newly started active replicas will receive 
the buffered outputs from their upstream neighboring tasks 
and then start the processing. Eventually, the newly generated 
active replicas will catch up with the progress of their primary 
copies. Dynamic plan adaptation is not implemented in the 
current version of our system, which is part of our future work. 

VI. Evaluation 

The experiments are run over the Amazon EC2 platform. 
We build a cluster consisting of 36 instances, of which 35 
ml.medium instances are used as the processing nodes and one 
cl.xlarge instance is set as the Storm master node. Heartbeats 
are used to detect node failures in a 5-second interval. The 
recovery latency is calculated as the time interval between 
the moment that the failure is detected and the instant when 
the failed task is recovered to its processing progress before 
failure. The processing progress of a task is defined as a vector. 
Each field of the progress vector contains the sequence number 
of the latest processed tuple from a specific input stream of 
the task. A failed task is marked as recovered if the values of 
all the fields in its current progress vector are larger than or 
equal to the values of the corresponding fields of the progress 
vector before failure. Additional information of the experiment 
configuration will be presented in the following sections. 

A. Recovery Efficiency 
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Fig. 6. Topology used in the experiments of recovery efficiency in the scale 
of operator. 

In the first set of experiments, we study the recovery 
efficiencies of different fault-tolerance techniques, including 
checkpoint, which is used in Spark Streaming, source replay, 
which is the default fault-tolerance technique in Storm, and 
active replication. In Storm, if failure happens, the source data 
will be reprocessed from scratch through the whole topology 
to rebuild the states of the tasks. 

We implement a topology that consists of 1 source operator 
and 4 synthetic operators. The structure of this topology is 
depicted in Eigure | 6 ] The source operator consists of totally 
16 tasks, which are on average deployed on 4 nodes. All of 
the source tasks produce input tuples for their downstream 
neighboring tasks in a specified rate (1000 tuples/s or 2000 
tuples/s). The degree of parallelization of operators Oi, O 2 , 
O 3 and O 4 are set as 8 , 4, 2 and 1 respectively. Each task 
in Oi receives inputs from two source tasks and each task in 
O 2 , O 3 and O 4 receives inputs from two upstream neighboring 
tasks. The primary replicas of the 15 synthetic tasks are evenly 
distributed among the 15 nodes. In addition, there are another 
15 nodes used as the backup nodes to store the checkpoints 
and to run the active replicas. 

























rate:1000tp/s rate:2000 tp/s ratc:1000tp/s rate:2000tp/s 

Fig. 7. Recovery latency of single node failure. 



rate:1000tp/s rate:2000 tp/s rate:1000tp/s rate:2000tp/s 

Fig. 8. Recovery latency of correlated failure. 



Fig. 9. Resource usage of maintaining checkpoints, 
window length: 30 seconds. 


Each of the four synthetic operators maintains a sliding 
window whose sliding step is set as 1 second and window 
interval varies from 10 seconds to 30 seconds. The state of 
each task of a synthetic operator is composed by the input 
data within the current window interval. The largest state size 
of a task is equal to the result of the input rate multiplies the 
window interval. The selectivity of the synthetic operator is 
set as 0.5. 


because the window intervals in this set of experiments are 
relatively short. In Storm, to build the window states, all the 
sources tuples belonging to the unfinished window instances 
in the failed tasks will be replayed, whose number increases 
linearly with the window length. While for the recovery with 
Checkpoint, the number of tuples that should be reprocessed 
to recover a failed task is at most equal to the value of the 
input rate multiplies the checkpoint interval. 


Single Node Failure. Figure Q presents the recovery la¬ 
tencies of single node failures with various input rates and 
window intervals using different fault-tolerance techniques. 
For active replication, we vary the intervals of trimming the 
output buffer of a task replica, which is equivalent to the 
frequency of synchronizing the replica with its primary task. 
One can see that the active approach has much lower recovery 
latency than the passive approaches and the changes of window 
intervals and input rates have little influence. On the other 
hand, the recovery latencies with both Checkpoint and Storm 
increase proportionally with the input rate, as a higher input 
rate results in more tuples to be replayed during recovery 
for both approaches. Furthermore, the recovery latency with 
Checkpoint increases with the checkpoint interval. This is 
because the number of tuples that need be reprocessed to 
recover the task state will increase with the checkpoint interval. 

As Storm will have to replay more source data with longer 
window intervals, one can see that the recovery latency of 
Storm with 30-second windows is higher than those with 10- 
second windows. Another factor that influences the recovery 
latency of Storm is the location of the failed task in the 
topology, because the replayed tuples will be processed by 
all the tasks located between the tasks of the source operator 
and the failed tasks. Thus the recovery latency of Storm is 
higher than that of Checkpoint in most of the cases in this 
experiment. Here, we record the recovery latencies of tasks 
in different locations within the topology in Storm and report 
their average values. 

Correlated Failure. We inject a correlated failure by killing 
all the nodes on which the primary replicas of the tasks are 
deployed. In Figure |8] one can see that active replication 
has much lower recovery latency than Checkpoint and Storm. 
Furthermore, active replication with a shorter synchronization 
period leads to faster failure recovery. This is because, with 
a longer synchronization period, an active replica will send 
more buffered tuples to its downstream tasks if its primary is 
failed. On the other hand, the recovery latency of Checkpoint 
increases rapidly with the increase of input rate and checkpoint 
interval. Storm has a lower recovery latency than that of 
Checkpoint with a 30-second checkpoint interval. This is 


By comparing the experimental results presented in Fig¬ 
ure |7] and Figure [8] it can be seen that the recovery latency 
with active replication is lower than the passive approaches 
and is relatively stable under the scenarios of various input 
rates and window intervals. Moreover, the benefits of using 
active replication are larger in the case of correlated failure 
than in the case of single node failure. This is because 
some synchronization operations will be performed during the 
recovery of correlated failures. 

The latency of failure recovery with checkpoint can be 
reduced by setting a short checkpoint interval. However, the 
resource usage of maintaining checkpoints varies with different 
checkpoint intervals. Figure |9] presents the ratio of the CPU 
usage of maintaining checkpoint to that of normal computation 
within a task. We can see that the CPU usage of maintaining 
checkpoints increases quickly with shorter checkpoint intervals 
and making checkpoint with very short intervals such as one 
second is prohibitively expensive. Although active replication 
consumes more recourses than the passive approach, the low- 
latency recovery of active replication makes it meaningful in 
the context of MPSPEs. 
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Fig. 10. Recovery latency of a correlated failure with PPA. window length : 
30 seconds. PPA-0.5-active indicates the recovery latency of actively replicated 
tasks in plan PPA-0.5. 

Recovery with PPA. We conducted experiments to study 
the performance of PPA with three active replication plans 
denoted as PPA-1.0, PPA-0.5 and PPA-0 respectively. These 
PPA plans consume various amount of resources for active 
replication. In PPA-1.0, all the tasks in the topology will be 
actively replicated. PPA-0.5 is a hybrid replication plan where 









































only half of the tasks have active replica. PPA-0 is a purely 
passive replication plan where all the tasks are only replicated 
with checkpoint. The results are presented in Figure (TO] As the 
failed tasks with active replicas will be recovered faster than 
those using checkpoints, the overall recovery latency of PPA- 
0.5 is higher than that of PPA-1.0 but lower than that of PPA-0. 
Note that with PPA-0.5, the recovery latencies of tasks with 
active replicas (denoted as PPA-0.5-active in Figure fTOl i are 
much lower than that of recovering all the failed tasks (denoted 
as PPA-0.5 in Figure fTOl i. The recoveries of PPA-0.5-active 
consume slightly less time than PPA-1.0, this is because the 
number of actively replicated tasks recovered in PPA-0.5-active 
is only the half of that in PPA-1.0. This set of experiments 
illustrate that the purely active replication plan outperforms the 
hybrid and purely passive plan regarding the recovery latency. 
With a hybrid plan, as the recoveries of actively replicated 
tasks finish earlier than that of the passively replicated ones, 
PPA can generate tentative outputs without waiting for the slow 
recoveries of passively replicated tasks. 

B. Tentative Output Quality 
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dataset: the user-location stream and the incident stream. The 
rate of the user-location stream is set as 20,000 location 
records per second. The incident stream is composed of user- 
reported incident events and the time interval between two 
consecutive incidents is set as 2 seconds. We distribute 100,000 
users among 1000 virtual road segments following the Zipfian 
distribution (with parameter s = 0.5). The incident probability 
of a segment is set to be proportional to the number of users 
located on it. If an incident occurs on a segment, all the users 
on this segment will report an incident event. The topology 
of <52 is presented in Figure [TT] Tasks in Oi receive the 
user-location records and calculate the average speed of each 
segment per second. Tasks in O 2 combine the user-reported 
incident events into distinct incident events. O 3 joins the 
segment-speed stream from Oi and the distinct-incident stream 
from 02 - The outputs of tasks in O 3 are the incidents that incur 
traffic jams. O 4 aggregates the outputs of O 3 . 



(a) Query: Ql. 



Fig. 12. Comparing the values of OF/IC and the query accuracy. OF-SA- 
Accuracy (or IC-SA-Accuracy) denotes the actual query accuracies of the PPA 
plans generated using the structure-aware(SA) algorithm with OF (or 1C) as 
the optimization metric. 


Fig. 11. Top-k aggregate query((3i) and incident detection query(Q 2 ) in 
the scale of operator. 

We implement two sliding window queries whose inputs 
are, respectively, from real and synthetic datasets. For each 
query, we define an accuracy function based on its semantic. 

Ql is a sliding-window query that calculates the top-100 
hottest entries of the official website of World Cup 1998. The 
input dataset is the server access log during the entire day 
of June 30, 1998 m, which consists of in total 73,291,868 
access records. In the experiments, we replay the raw input 
stream in a rate which is 48 times faster than the original data 
rate. We implement this query as a topology that conducts 
hierarchical aggregates, which is a common computation in 
data stream applications. The structure of this topology is 
depicted in Figure [TT] Input tuples are partitioned to the tasks 
in Oi by their server ids. Tasks in Oi split the input stream into 
a set of consecutive slices, each consisting of 100 tuples, and 
calculate their aggregate results. For every 100 input tuples, 
tasks in O 2 will conduct a merge computation and send the 
results to the single task in O 3 , which periodically updates the 
globally top -100 entries for every 100 input tuples. 

Q 2 is a sliding-window query that detects the traffic 
incidents resulting in traffic jams. The window interval is 5 
minutes and the sliding step is 10 seconds. As relevant datasets 
for this query are not publicly available due to privacy con¬ 
siderations, we generate a synthetic dataset in a community- 
based navigation application. There are two streams in this 



(a) Query: Ql. (b) Query: Q2. 


Fig. 13. Comparing the values of OF and the actual query accuracies of the 
PPA plans which are generated by the dynamic programing algorithm(DP), 
structure-aware algorithm(SA) and greedy algorithm(Greedy) respectively. 

Validation of the OF metric. In this set of experiments, 
we examine whether OF can predict the actual quality of the 
tentative output. We compare it with the Internal Completeness 
(IC) metric proposed in H, which measures the fraction of 
the tuples that are expected to be processed by all the tasks 
in case of failures compared to the case without failures. A 
fundamental difference between OF and IC is that, OF takes 
the correlations of task’s input streams into account. 

By denoting the tentative outputs as St and the accurate 
outputs of Ql as Sa, we define the query accuracy of Qi as: 

. Figure [r2(a)| shows the OF (or IC) values and the 
actual query accuracies of the PPA plans generated using the 
OF (or IC) metric. The results show that both OF and IC pro¬ 
vide good predictions of the accuracy of typical top-k queries. 
This is because both OF and IC provide accurate estimations 
of the completeness of the inputs for aggregate queries, such 
as top-k, and such queries’ output accuracies highly depend on 
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Fig. 14. Comparing OF of SA and Greedy algorithm with random topologies of various specifications, number of operators is set as a random integer between 
5 and 10. (a): The workloads of tasks within an operator are distributed in uniform or Zipfian distribution (with parameter s = 0.1). (b): The degree of operator 
parallelization is a random number between different ranges, (c): Topologies are either stmctured topology or full topology, (d): The fraction of join operators 
in the topologies is set as 0 or 50%. 


the completeness of their inputs. The accuracy function of Q 2 

is defined as where It is the set of tentative incidents 

generated with correlated failure and I a is the set of accurate 
incidents generated without failure. As shown in Figure [T2(b)[ 
the accuracy values are generally quite close to the values 
of OF. On the other hand, with more available resources, we 
can generate PPA plans with higher IC values. However, such 
plans do not have higher query accuracies. This is because IC 
fails to consider the correlation of tasks’ input streams and 
hence cannot provide a good accuracy prediction for queries 
with joins. This result clearly indicates the importance of 
distinguishing join operators in predicting output accuracies. 

Comparing Various Algorithms. In this set of experiments, 
we generate PPA plans for Qi and Q 2 using the dynamic 
programing algorithm(DP), the structure-aware algorithm(SA) 
and the greedy algorithm respectively and compare their per¬ 
formances. Results presented in Figure [T^ show that SA is 
quite close to DP, which generates the optimal PPA plan, in 
both OF and the actual query accuracy. Greedy has the worst 
performance in the results of both queries. This is because 
Greedy fails to consider that only complete MC-trees can 
contribute to the query outputs. 

C. Random synthetic topology 

To conduct a comprehensive performance study of PPA 
algorithms with various types of topologies, we implemented 
a random topology generator which can generate topologies 
with different specifications. In the experiments, for each set of 
topology specifications, we generate 100 synthetic topologies 
and use them as the inputs of the structure-aware algorithm 
and the greedy algorithm to compare their performances in 
terms of OF. Due to the prohibitive complexity of the dynamic 
programing algorithm, we cannot complete it for this set of 
experiments within a reasonable time so we do not include 
it here. Query accuracies are not compared in this set of 
experiments, as we cannot derive the actual output accuracies 
for these randomized synthetic topologies. 

In Figure [141 one can see that, SA outperforms the greedy 
algorithm in all the combinations of topology specifications 
and active replication ratios. With smaller replication ratio, 
there is a greater difference between SA and the greedy 
algorithm. This is because the greedy algorithm is agnostic 
to the structure of the query topologies, and with a smaller 
replication ratio, there is smaller probability that the tasks 
selected by the greedy algorithms can form complete MC-trees 
that can contribute to the final output. 


Figure |14(a)| depicts the effects of workload skewness of 
tasks within the operators. We can see that SA has better 
performance for topologies that have higher skewness of task 
workloads. This is because, as the skewness of workloads 
increases, the skewness of MC-trees’ contributions to the value 
of OF also increases and SA, by prioritizing tasks that are in 
the MC-trees, achieves higher OF values. In Figure |14(b)| we 
report the results with varying parallelization degrees of an 
operator. One can see that increasing the parallelization degrees 
will also increase the value of OF, because a higher paralleliza¬ 
tion degree slightly increases the skewness of the workloads of 
the tasks in this set of experiments. As shown in Figure [T4(c)[ 
the OF of structured topologies are generally higher than the 
full topologies. This is because within an operator using Full 
partitioning, the failure of any task will reduce the input of 
all the downstream tasks. For full topologies, the structure- 
aware algorithm generates active replication plan in the similar 
approach as the greedy algorithm does, thus their performances 
are close in this set of experiments. Figure [l4(d)| presents the 
results with various fractions of operators being join operators. 
For the same topology, OF decreases with more operators set 
as joins. This is because the loss of one input stream of a 
join operator will result in parts of the other (correlated) input 
streams being useless. 

VII. Related Work 

Fault-tolerance in SPE. Traditional fault-tolerance tech¬ 
niques for SPEs could be categorized as passive El, 
ifTTl . El, ifTSl and active approaches El . llTl . (El- The 
technique of delta checkpoint HH is used to reduce the size of 
checkpoints. The authors in ||9l proposed techniques to reduce 
the checkpoint overhead by minimizing the sizes of queues 
between operators, which are part of the checkpoints. Il20l 
proposed to utilize the idle period of the processing nodes for 
active replication. Such optimizations are compatible to our 
PPA scheme and can be employed in our system. 

Spark Streaming ll26l uses Resilient Distributed Dataset 
(RDD) to store the states of processing tasks. In case of 
failure, RDDs can be restored from checkpoints or rebuilt by 
performing operations that were used to build it based on its 
lineage. In other words, it adopts both the checkpoint-based 
and the replay-based approaches. 

For other large-scale computing systems, such as Map¬ 
Reduce Id, the overall job execution time is a critical metric. 
However, for MPSPEs, it is the end-to-end latency of tuple 
processing that matters, which makes the low-latency failure 
recovery an important feature in the context of MPSPEs. To 
reduce recovery latency, authors in 0, 1261 proposed to use 

















parallel recovery and/or integrating fault tolerance with scale- 
out operations. In parallel recovery, multiple tasks can be 
launched to recover a failed task and each of them is recovering 
a partition of the failed one to shorten the process of passive 
recovery. However, with a correlated failure, a large number 
of failed tasks need to be recovered simultaneously. Then the 
possibilities of fast scaling out and the degrees of parallel 
recovery would be constrained. 

Hybrid fault-tolerance approaches are proposed in Il25ll . 
GD. In 1^ . the objective is to minimize the total cost by 
choosing a passive fault-tolerance strategy, including upstream 
buffering, local checkpoint and remote checkpoint, for each 
operator. GD uses either active replication or checkpoint as 
the fault-tolerance approach for an operator. The optimization 
objective in IfTTlI is to minimize the total processing cost while 
satisfying the user-specified threshold of recovery latency, 
where only independent failure is considered. The work in lIZTl 
considers task overloading, referred to as “transient” failure, 
caused by temporary workload spikes. Upon a transient failure 
of a task, its active replica will be used to generate low-latency 
output. Different from these approaches, the trade-off of our 
work is between resource consumption and result accuracy 
with correlated failures. 

Tentative Outputs. Borealis |[3l uses active replication for 
fault tolerance and allows users to trade result latency for 
accuracy while the system is recovering from a failure. More 
specifically, if a failed node has no alive replica. Borealis will 
produce tentative outputs if the recovery cannot be finished 
within a user-defined interval. PPA adopts a similar mechanism 
for generating tentative outputs but explores more on optimiz¬ 
ing the accuracy of tentative results. Previous work H attempts 
to dynamically assign computation resources between primary 
computation and active replicas to achieve trade-offs between 
system throughput and fault-tolerance guarantee. Their accu¬ 
racy model, IC, does not consider the correlation of processing 
tasks’ inputs streams, which is shown to be inadequate in our 
experiments. The brute-force algorithm proposed in H which 
has a high complexity as our dynamic programing does. 

A fault injection-based approach is presented in GSl to 
evaluate the importance of the computation units to the output 
accuracy, which only considers independent failures. Zen 
optimizes operator placement within clusters under a correlated 
failure model, which specifies the probability that a subset 
of the nodes fail together. The objective is to maximize 
the accuracy of tentative outputs after failures. As operator 
placement is orthogonal to the planning of active replications, 
their techniques can also be employed as a supplement to PPA. 

Failure in Clusters. Previous studies found that failure 
rates vary among different clusters and the number of failures 
is in general proportional to the size of the cluster ||23|. 
Correlated failures do exist and their scopes could be quite 
large Gi, im. Hence considering correlated failure is in¬ 
evitable for a MPSPE that supports low-latency and nonstop 
computations. 

VHI. Conclusion 

In this paper we present a passive and partially active (PPA) 
fault-tolerance scheme for MPSPEs. In PPA, passive check¬ 
points are used to provide fault-tolerance for all the tasks, while 


active replications are only applied to selective ones according 
to the availability of resources. A partially active replication 
plan is optimized to maximize the accuracy of tentative outputs 
during failure recovery. The experimental results indicate that 
upon a correlated failure, PPA can start producing tentative 
outputs up to 10 times faster than the completion of recovering 
all the failed tasks. Hence PPA is suitable for applications that 
prefer tentative outputs with minimum delay. The experiments 
also show that our structure-aware algorithms can achieve up 
to one order of magnitude improvements on the qualities of 
tentative outputs in comparing the greedy algorithm that is 
agnostic to query topology structures, especially when there is 
limited resource available for active replications. Therefore, to 
optimize PPA, it is critical to take advantage of the knowledge 
of the query topology’s structure. 
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